Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function

Abstract Among the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and, in combination with other mutations, adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron overall previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.


Introduction
The Omicron (B.1.1.529) SARS-CoV-2 variant of concern (VOC) identified in Southern Africa in late November 2021 (Viana et al. 2022) is the product of extensive evolution within an infection context that has so far yielded at least three genetically distinct viral lineages (BA.1, BA.2, and BA.3) since it diverged from an ancestral B.1.1 lineage (presumably at some time in mid to late 2020). Three possible explanations for the sudden appearance of Omicron without any prior detection of intermediate/progenitor forms before its discovery are: (1) SARS-CoV-2 genomic surveillance in the region where Omicron originated might have been inadequate to detect intermediate forms; (2) long-term evolution in one or more chronically infected people-similar to the proposed origin of lineages such as Alpha and C.1.2 (Rambaut et al. 2020;Scheepers et al. 2021;Cele, Karim, et al. 2022)-may have left intermediate forms unsampled within one or a few individual(s); and (3) reverse zoonosis to a nonhuman host, followed by undetected spread and diversification therein prior to spillover of some sublineages back into humans (Wei et al. 2021). At present, there is no strong evidence to support or reject any of these hypotheses on the origin of Omicron, but as new data are collected, its origin may be more precisely identified.
Regardless of the route that Omicron took to eventual community transmission, the genome of the BA.1 lineage that caused surges of infections globally in late 2021 and early 2022, accumulated 53 mutations relative to the Wuhan-Hu-1 reference strain, with 30 nonsynonymous substitutions in the Spike-encoding S-gene alone ( fig. 1). Here, we characterize the selective pressures that may have acted during the genesis of the BA.1 lineage and curate available data on the likely adaptive value of the BA.1 S-gene mutations. We were particularly interested in identifying BA.1 S-gene codon sites displaying evolutionary patterns that differed from those of other SARS-CoV-2 lineages (including the variation of SARS-CoV-2 in individual hosts), and closely related nonhuman sarbecoviruses. We use these comparisons to identify which BA.1 S-gene mutations might contribute to recently discovered shifts relative to other SARS-CoV-2 variants in the way that BA.1 interacts with human and animal ACE2 receptors and is primed by cellular proteases to mediate cellular entry (Cameroni et al. 2022;Gobeil et al. 2022;McCallum et al. 2022;Meng et al. 2022;Peacock et al. 2022;Willett et al. 2022). Our analysis identifies three clustered sets of mutations in the Spike protein, involving amino acid substitutions at 13 sites previously highly conserved across other SARS-CoV-2 lineages and other sarbecoviruses. The dramatic about-face in evolutionary dynamics at the 13 codon sites encoding these amino acids indicates that the mutations at these sites in BA.1 are likely interacting with one another, that the combined effects of these interactions are likely adaptive, and that these adaptations likely underlie at least some of the recently discovered shifts in BA.1 Spike function.
detectably evolving under positive selection when considering all SARS-CoV-2 genomic data prior to the discovery of Omicron (table 1 and fig. 2; https://observablehq.com/@ spond/selection-profile). For context, this fraction of positively selected sites (0.53) is approximately four times higher than the fraction of all SARS-CoV-2 S-gene sites that have ever shown any signals of positive selection (0.14).
The observed substitutions at 4 of these 16 sites (K417N , N501Y (Starr et al. 2020;Yuan et al. 2021;Zahradník et al. 2021), H655Y (Escalera et al. 2022), and P681H (Lubinski et al. 2022)) and a two nucleotide deletion at one additional site (Δ69-70 (Meng et al. 2021)) are among the 19 "501Y meta-signature" Spike mutations that are likely highly adaptive within the context of 501Y lineage viruses such as the Alpha, Beta, and Gamma VOCs . Given that the BA.1 mutations at these sites converge on those seen in these other VOCs, they are likely to be adaptive in BA.1 lineage viruses as well (sites colored red in fig. 3).
A further four BA.1 S-gene mutations are found in SARS-CoV-2 sequences belonging to other VOC lineages, and are either VOC lineage-defining mutations (majority mutations), or are lower-frequency mutations that have increased in frequency .2 fold between early and late VOC lineage circulation periods within sampled sequences belonging to these lineages (A67V in Alpha and Beta, T95I in Beta and Gamma, T478K in Beta, and N679K in Gamma; https://observablehq.com/@spond/sc2-selection-trends): an indication that these mutations too are likely adaptive in BA.1 lineage viruses (table 1). Additionally, three other BA.1 S-gene mutations either: (1) occur at the same codon sites as Alpha, Beta, Gamma, or Delta lineage-defining mutations but encode a different amino acid than these other lineages (E484A in BA.1 and E484K in Beta and Gamma); or (2) occur at the same codon sites as mutations in VOC lineages that increased in frequency .2 fold between early and late VOC lineage circulation periods but encode a different amino acid than these other lineages (N440K in BA.1 and N440S in Alpha; S477N in BA.1 and S477I in Beta and Gamma). Lastly, the S/D796Y mutation occurs at one of the four sites identified as potential locations of adaptation in human beta-coronaviruses via the analysis of convergent evolutionary patterns and functional impact (table 1; Escalera-Zamudio et al. 2021) and changes at this site (including D796Y) have previously been inferred to be potentially adaptive within the context of chronic SARS-CoV-2 infections Cele, Karim, et al. 2022). All of these BA.1 mutations likely have a substantial phenotypic impact (colored orange in fig. 3).
Finally, three deletions (Δ69-70, Δ143-145, and Δ211-212) and a nine nucleotide insertion (between codons 214 and 215) in the N-terminal domain encoding part of the S-gene all likely have phenotypic impacts and all are potentially adaptive but are not considered further here because they are not amenable to analysis by natural selection analysis methods that focus on patterns of synonymous and nonsynonymous mutations.

Clusters of BA.1 Mutations Occur at Neutral or Negatively Selected S-Gene Sites
The mutations occurring at the 14 BA.1 Spike codons which display either evidence of negative selection or no evidence of selection (neutral evolution) have rarely been seen within previously sampled sequences (bold rows in MBE omicron-mutations-tables) indicating the action of strong purifying selection due to functional constraints. Despite the rarity of these mutations in assembled genomes, it is not uncommon to find them in within-patient sequence data sets ( fig. 4), often at subconsensus allelic frequencies.
This indicates that, with the possible exceptions of S/ S371L, S/N764K, S/N856K, and S/Q954H, the mutations at these sites are not rare simply because they are unlikely to occur (note the sizes and numbers of dots corresponding to these mutations in fig. 4), but rather because whenever they do occur they are unlikely to either increase sufficiently in frequency to be transmitted (note the predominantly light orange/yellow colors of the dots corresponding to these mutations in fig. 4), or increase sufficiently in frequency among transmitting viruses to be detected by genomic surveillance. On their own, none of these 14 BA.1 mutations at codon sites that have previously been evolving either neutrally or under negative selection prior to November 2021 would be expected to provide SARS-CoV-2 with any selective advantage. If the BA.1 mutations observed at the ten negatively selected S-gene codon sites had occurred in the Wuhan-Hu-1 sequence, it is very likely that they would have been selected against. Specifically, since the start of the pandemic Spike proteins tended to function best whenever they had amino acids at these ten sites that were the same as those in the Spike encoded by the Wuhan-Hu-1 sequence.
It is clear that the amino acids encoded by 13 of the 14 mutated codon sites in the BA.1 S-gene that either show evidence of negative selection or no evidence of any selection, cluster within three regions of the Spike threedimensional structure (dark blue sites in fig. 3): 1) Cluster region 1 in the RBD (green sites in fig. 5): codons/amino acids S/339, S/371, S/373, and S/375; may be targeted by some class 4 neutralizing antibodies (Barnes et al. 2020). S/371L alone impacts, but probably does not provide escape from, binding of some antibodies in all four neutralizing antibody classes (Liu et al. 2022) suggesting that, in a Wuhan-Hu-1 genetic background, it may MBE substantially impact the glycosylation profile, trimerization, or balance of up-down protomer conformations of Spike (Gobeil et al. 2022). An S/S371F mutation, as occurs in BA.2, has previously been detected in the context of a chronic SARS-CoV-2 infection (Maponga et al. 2022). 2) Cluster region 2 in the receptor-binding motif (cyan sites in fig. 5 Pond et al. 2020). This method was also used in Martin et al. (2021). Red circles show sites under positive selection (selection favoring changes at amino acid states encoded at these sites). Blue circles show sites under negative selection (selection against nonsynonymous changes). When no circle is shown, the corresponding site offered no statistical evidence for nonneutral evolution at a given time point. The areas of circles indicate the statistical strength of the selection signal (and not the actual strength of selection) within sequences sampled in the 3 months preceding the first day of the indicated months. Note that none of these analyses included any Omicron sequences, hence selection signals are derived solely from other SARS-CoV-2 lineages.
Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE 498, and S/505. This region is known to be targeted by class 1 and class 2 neutralizing antibodies . S/493 is, in fact, a known target of such antibodies. Accordingly, S/Q493R (as occurs in BA.1) escapes some class 2 neutralizing antibodies (Liu et al. 2022), and S/Q493R and S/Q493K escape mutations have been selected in VSV in vitro experiments (Weisblum et al. 2020). S/Q493K, S/G496S, S/ Q498R, and S/Y505H mutations have also all arisen previously in the context of chronic SARS-CoV-2 infections (Choi et al. 2020;Maponga et al. 2022;Riddell et al. 2022;Wilkinson et al. 2022). The S/ Q498R and S/Q493R mutations yield two additional salt bridges when binding human ACE2 (Mannar et al. 2022;McCallum et al. 2022) and it is likely that the increased affinity of BA.1 Spike for human ACE2 relative to that of Alpha, Beta, Delta, and Wuhan-Hu-1 (Cameroni et al. 2022;Meng et al. 2022;Peacock et al. 2022) will further decrease its sensitivity to neutralization. 3) Cluster region 3 in the fusion domain (yellow sites in fig. 5): codons/amino acids S/764, S/856, S/954, S/ 969, and S/981; a region of Spike currently not known to be targeted by neutralizing antibodies. The S/N764K, S/N856K, and S/N969K mutations are likely to enhance interactions between the S1 and S2 subunits of the BA.1 Spike and are likely to contribute to reduced S1 shedding following proteolytic cleavage of the polybasic S1/S2 site (Zeng et al. 2021;McCallum et al. 2022).
The 14th BA.1 S-gene mutation site that has been evolving under negative selection in non-BA.1 SARS-CoV-2 FIG. 3. Distribution of BA.1 amino acid replacements on the three-dimensional SARS-CoV-2 Spike trimer. In this rendering of the trimer, one protomer subunit is shown in the "up" or "open" configuration while interacting with human ACE2 (Xu et al. 2021). The other two subunits are in the "down" or "closed" configurations. Amino acids are color coded according to their likely contribution to viral adaptation in a Wuhan-Hu-1-like genetic background based on (1) patterns of synonymous and nonsynonymous substitutions at the codons encoding these amino acids in non-Omicron sequences, (2) patterns of mutational convergence between viruses in different VOCs, and (3) increases in the frequency over time of VOC sublineages encoding amino acids that match those found in BA.1. NTD, N-terminal domain, RBD, Receptor-binding domain; RBM, receptor-binding motif. Locations of sites in the three clusters of BA.1 mutations that are rarely seen and fall at either negatively selected (dark blue) and neutrally evolving (light blue) sites. An interactive version of this figure can be found here: https://observablehq.com/@stephenshank/sars-cov-2-ace2-protein-interaction-and-evolution-for-omicr.
Martin et al. · https://doi.org/10.1093/molbev/msac061 MBE lineages, S/212, is within the N-terminal domain (dark blue site in the region marked "NTD" in fig. 3). In BA.1 sequences, S/212 is bordered by a deletion at S/211 and an insertion of three amino acids after S/214. These adjacent mutations substantially alter the structural context of S/ 212 and it is expected that the selective regime under which this site is evolving in BA.1 will have also been altered. It would, therefore, be unsurprising if amino acid substitutions at this site were not as maladaptive in BA.1 as they likely are in other lineages.
It is also noteworthy that several mutations in each of the three cluster regions differ between BA.1 and its sister lineage, BA.2: In cluster region 1, BA.2 has a S/S371L mutation instead of a S/S371F, in cluster region 2, BA.2 is missing the S/G496S mutation, and in cluster region 3, BA.2 is missing the S/N856K and S/L891F mutations. These "missing" FIG. 4. Intrapatient allelic variation seen at BA.1 amino acid mutation sites in a subset of SARS-CoV-2 raw sequencing data since March 2020 analyzed using a standardized variant calling pipeline (Maier et al. 2021). The areas of the circles indicate the proportions of raw sequence data sets (per 1,000 samples) where a mutation away from the Wuhan-Hu-1 consensus sequence was called. The color of the circle indicates the median intrapatient allele frequency (AF) in data sets for which each mutation was detected. Mutations occurring at lower AFs are only present in a subpopulation of viruses in a particular host. The data have been generated by calling variants from read-level data of 230,506 samples from COG-UK, Estonia, Greece, Ireland, and South Africa: PRJEB37886, PRJEB42961 (and multiple other bioprojects with the study title: Whole-genome sequencing of SARS-CoV-2 from Covid-19 patients from Estonia), PRJEB44141, PRJEB40277, and PRJNA636748. Note that S371L is the result of two nucleotide substitutions in codon S/371 and was never detected in intrapatient samples. S371F represents an intermediate mutation between the Wuhan-Hu-1 state and that of BA.1 and is presented here for completeness. Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE mutations in BA.2 were likely the last of the cluster region mutations to be acquired by the BA.1 progenitor, following its split from the BA.2 lineage.

Selection Patterns in Sarbecoviruses Confirm That, on Their Own, Many BA.1 Mutations Would Likely Be Deleterious
To determine whether patterns of selection at the Omicron/BA.1-specific sites are broadly consistent with those occurring in the horseshoe bat-infecting SARS-related coronaviruses (in the Sarbecovirus subgenus to which SARS-CoV-2 belongs), we examined patterns of synonymous and nonsynonymous substitutions in 167 publicly available Sarbecovirus genomes. Accounting for recombination, we tested for selection signatures at all 44 codons encoding amino acids that differ between Wuhan-Hu-1 and BA.1 (https://observablehq.com/@ spond/ncos-evolution-nov-2021). We specifically focused the analyses on selection signals in the subset of FIG. 5. Positions on the three-dimensional SARS-CoV-2 Spike trimer of amino acids encoded by three clusters of BA.1 codon sites that are evolving either neutrally or under negative selection in non-Omicron SARS-CoV-2 sequences. The Spike protomer subunit interacting with human ACE2 is in the "up" configuration and the other two are in the "down" configuration (Xu et al. 2021). The cluster region 1 and 2 encoded amino acid changes in BA.1 (in green and blue, respectively) are within the receptor-binding domain of Spike with the cluster 2 encoded changes located within the receptor-binding motif. The cluster region 3 mutations are within the fusion domain of Spike. An interactive version of this figure can be found at https://observablehq.com/@stephenshank/sc2-omicron-clusters.
Martin et al. · https://doi.org/10.1093/molbev/msac061 MBE sarbecoviruses that are more closely related to SARS-CoV-2 in each recombination-free part of their genome: a group of sequences we refer to as the nCoV clade (Lytras et al. 2022). Depending on the recombination-free genome region being considered, this clade was represented by a range of between 15 and 27 sequences. We refer to the remaining sarbecoviruses as the non-nCoV sequences.
Of the 44 codon sites considered, 26 are detectably evolving under negative selection (FEL P-value ,0.05; Kosakovsky Pond and Frost 2005;Kosakovsky Pond et al. 2020) and one (S/417) under positive selection (MEME P-value ,0.05; Murrell et al. 2012) in the nCoV clade. This positive selection signal at S/417 reflects an encoded amino acid change from an ancestral V that is present in all background sequences, to a K that is specific to the nCoV clade. A K is also encoded at this site in Wuhan-Hu-1 but has since changed multiple times in various SARS-CoV-2 lineages: for example, to an N during the genesis of lineages such as Omicron and Beta and to a T during the genesis of the Gamma lineage.
We were, however, particularly interested in whether the cluster 1, 2, and 3 mutation sites in the S-gene were also evolving in a constrained manner (i.e., under negative selection) in the nCoV clade and, if so, what the selectively favored encoded amino acid states were at these sites. Consistent with the hypothesis that the Wuhan-Hu-1 encoded amino acid states are generally constrained in the closest known SARS-CoV-2 relatives, the cluster 1 sites S/339, S/373, and S/375, the cluster 2 site S/505 and the cluster 3 sites S/764, S/856, S/969, and S/981 were all detectably evolving under negative selection in the nCoV clade viruses with the Wuhan-Hu-1 encoded amino acid state being favored at all eight of the sites. Also consistent with the hypothesis, two of the remaining five sites across the clusters that were not detectably evolving under negative selection in the nCoV clade (S/371 and S/954) predominantly encoded the Wuhan-Hu-1 amino acid state in all sarbecoviruses. Only the cluster 2 sites S/493, S/ 496, and S/498 seem to vary substantially across the Sarbecovirus subgenus.
What Can the Sarbecoviruses Tell Us About the Biological Consequences of the Rarely Seen BA.1 Mutations?
Despite the observation that, even among sarbecoviruses, BA.1 mutations seen in cluster regions 1, 2, and 3 are only rarely seen, the instances where they do occur might be illuminating. For example, among the bat-infecting sarbecoviruses, the BA.1 S/G339D substitution (in cluster region 1) has primarily to date been found among the bat-infecting viruses within a clade that does not use ACE2 as a cell entry receptor ( fig. 6; Starr, Zepeda, et al. 2022). The change in receptor-binding function in these viruses is, however, most likely due to two receptor- FIG. 6. Phylogenetic trees of 167 sarbecoviruses indicating patterns of selection at S-gene codons S/339 (left tree), S/493 (middle tree), and S/505 (right tree). Branches along which amino acid states have changed are indicated with thick lines. Dashed lines represent long branches that have been shortened for visual clarity. The highlighted segments of the middle and right trees indicate the branch across which S/N493R and S/Y505H mutations occurred. The trees represent evolutionary relationships between putatively nonrecombinant sequence fragments in the genome region corresponding to Wuhan-Hu-1 Spike positions 324-654. The clade containing sarbecoviruses sampled in Europe and Africa has been used as the outgroup for rooting. Tree tips are annotated by amino acid states at the respective sites. SARS-CoV-2 is annotated with a green tip symbol and the nCoV clade sequences with a tip symbol in orange.
Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE binding motif deletions that are also specific to this clade. Furthermore, cluster region 1 codon sites S/371, S/373, and S/375 encode a conserved serine (S) in almost all the analyzed sarbecoviruses (164/167, 165/167, and 167/167, respectively). The change at sites S/371 and S/375 from an encoded polar residue (S) to a hydrophobic residue (an L at S/371 and an F at S/375) implies a substantial change in the biochemical properties of this region of Spike that has never before been seen in any sarbecovirus. These changes could be associated with SARS-CoV-2's unique loss of the N370 glycosylation site relative to all other sarbecoviruses (Kang et al. 2021), or packing of this surface with other BA.1 changes in cluster 2 in locked or closed Spike trimer structures (e.g., 3.4 Å between S/S373P and S/Y505H in PDB 7TF8; Gobeil et al. 2022).
As with SARS-CoV-2, the amino acids encoded at cluster region 2 sites (all of which fall within the receptorbinding motif) vary substantially between different sarbecoviruses but without any associated signals of positive selection at these sites within the nCoV clade. Notably, the same BA.1 encoded amino acids at codon S/493R and S/ 505H also co-occur in a clade of sarbecoviruses that are closely related to SARS-CoV (GenBank accessions: KY417144, OK017858, KY417146, OK017852, OK017855, OK017853, OK017854, OK017856, and OK017857); although S/493R (AY613951 and AY613948) and S/505H (MN996532 and LC556375) can also occur independently. Besides the various Omicron sublineages, S/493R and S/ 505H are not found as a pair in any other SARS-CoV-2 sequences. These mutations occurring along the same branch of the sarbecovirus tree ( fig. 6) suggest that, rather than favoring changes at the sites individually, selection may favor simultaneous changes to S/493R and S/505H due to these residues together having a greater combined fitness benefit than the sum of their individual effects: a type of interaction between genome sites referred to as positive epistasis.
The region 3 cluster sites are conserved across the sarbecoviruses with almost all known viruses having the same residues at these sites as the Wuhan-Hu-1 SARS-CoV-2 strain. This supports the hypothesis that, when considered individually, the mutations seen at these fusion domain sites in BA.1 are likely to be maladaptive.

BA.1 Mutations at Neutral or Negatively Selected S-Gene Sites Might Only Be Adaptive When They Co-Occur
Given both the apparent selective constraints on mutations arising at the cluster region 1, 2, and 3 sites in SARS-CoV-2 and other sarbecoviruses, and the rarity of observed mutations at these sites among the millions of assembled SARS-CoV-2 genomes (despite evidence that individually such mutations do regularly occur during within-host evolution; fig. 4), it is very likely that BA.1 mutations at cluster region 1, 2, and 3 sites are maladaptive when present on their own. Nevertheless, the presence of mutations at these sites in BA.1, a lineage of viruses that is clearly highly adapted, suggests that these mutations might interact with one another such that, when present together, they become adaptive. Therefore, while individually the mutations might decrease the fitness of any genome in which they occur, collectively they might compensate for one another's deficits to yield a fitter virus genotype under certain conditions: such as prolonged infections, transmission in a nonhuman species, or a combination of these.
Positive epistasis of this type has, in fact, already been demonstrated between the cluster 2 mutations, S/Q498R and S/G496S, and the pivotal mutation of the 501Y SARS-CoV-2 lineages, S/N501Y . For example, whereas S/498R only marginally impacts the affinity of Spike for human ACE2 when present with S/501N (Starr et al. 2020), it strongly increases ACE2 binding affinity when present with S/501Y (Bate et al. 2021;Zahradník et al. 2021;. Structural analyses of inter-protomer interactions within the Spike trimer of SARS-CoV-2 and other sarbecoviruses imply that epistasis likely occurs among and between some cluster 1 and cluster 2 mutation sites. Specifically, in the genetic context of Wuhan-Hu-1, S/ 371S and S/373S (cluster 1) of one Spike protomer within a trimer, are likely to interact via hydrogen bonds with S/ 493Q and S/505Y (cluster 2) of an adjacent protomer in the trimer when Spike is in its down configuration ( fig. 5; Wrobel et al. 2020). In BA.1, the S/S371L, S/S373P and S375F cluster 1 mutations, and the S/Y505H cluster 2 mutation appear to strengthen these interactions, decreasing the flexibility of RBD within the trimer and stabilizing the down configuration of Spike (Gobeil et al. 2022).
If mutations in the three cluster regions do epistatically interact with one another, then one might expect that selection would favor their co-occurrence either within individual SARS-CoV-2 genome sequences that have so far been sampled, or as minor variants within unassembled intrapatient sequence data. We failed to detect such associations in any systematic manner ( fig. 7). While there are individual pairs of BA.1 mutations that co-occur more frequently than expected by chance (e.g., 440K in the presence of 95I), they do not involve cluster 1, 2, and 3 mutations. Furthermore, many of the BA.1 mutation pairs occur together less frequently than expected by chance (e.g., 478K and 501Y). Rather than reflecting an absence of epistasis between the cluster 1, 2, and 3 mutation sites, our failure to detect the co-occurrence of Omicron mutation pairs at these sites simply reflects the rarity of these mutations within both assembled SARS-CoV-2 genome sequences (table 1) and raw intrapatient sequence data sets ( fig. 4) MBE mutations might show evidence of coevolution during the ongoing diversification of the BA.1 lineage. We, therefore, tested the 135,247 BA.1 annotated S-gene sequences that were available in GISAID (Elbe and Buckland-Merrett 2017) as of January 5, 2022 for evidence that any of the 630 site pairs with sufficient evolutionary signal (at least two nonsynonymous substitutions along internal branches of a subsampled tree of genetically unique S-gene sequences) were coevolving using a Bayesian graphical model method (Poon et al. 2007).
Using a Bayesian MCMC inference approach, we found six pairs of sites to be coevolving with posterior probability (PP) ≥ 0.9 ( fig. 8). Two sites in cluster 1 (S/371 and S/375) share substitutions along three internal tree branches (in all cases reversions to Wuhan-Hu-1 S residues at both sites) with the LF → SS reversion pair at these sites having a co-occurrence log-odds (LOD) of 6.5. In cluster 2, S/493 co-evolves with S/496 and S/498; in both cases substitutions along two internal branches are shared, and in both cases these substitutions are reversions to Wuhan-Hu-1 residues (RS → QG; LOD = 6.6 and RR → QQ, LOD = 6.4). One of the two branches involves the reversion of all three residues. Two sites in cluster 3, S/856 and S/954, are detectably coevolving in that they share a KH → NQ substitution pair along one internal tree branch (LOD = 8.2). It is noteworthy, however, that whereas S/954 is mutated in BA.2, S/856 is not. Given that BA.2 is likely more transmissible than BA.1 in at least some situations (Lyngse et al. 2022), changes at S/954 are not obviously maladaptive in the absence of changes at S/856. It, therefore, remains unclear why these sites might be coevolving in BA.1.
We recommend a high degree of caution when interpreting the results of these coevolution analyses. Although the detected coevolution between sites in the three cluster regions supports the hypothesis that at least some mutations within each of the cluster regions are epistatically interacting with one another, it is concerning that these signals of coevolution are exclusively driven by reversion mutations.

FIG. 7.
Patterns of co-occurrence of BA.1 amino acid residues in circulating SARS-CoV-2 S-gene haplotypes from other lineages (data up to October 15, 2021). Only mutations occurring in at least 10 haplotypes are shown. All sequences having exactly the same S-gene sequence count as a single unique haplotype; instead of counting raw sequence numbers, this approach focuses on the number of unique genetic backgrounds in which pairs of codons co-occur. Circles show odds ratios for finding the mutation on the X-axis when the mutation on the Y-axis is also present (vs. when it is not present). Red circles depict odds ratio (OR) . 1, while blue circles 1/OR for OR , 1. Black circles on the right show the fraction of globally sampled SARS-CoV-2 S-gene haplotypes that carry the corresponding mutation.
Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE Despite restricting our analysis only to mutations mapping to the internal branches of the BA.1 phylogenetic tree-a precaution intended to minimize the impact of sequencing errors or intrahost signals of selection-it is plausible that BA.1 sequencing errors are so pervasive (Arctic Network) that at least some of these errors might have been incorrectly mapped to internal phylogenetic tree branches. We discuss the issue of spurious reversion mutations in more detail below.

How Might Mutations in the Three Cluster Regions Impact Spike Function?
Regardless of if epistasis is operating between mutations within and/or between the three cluster regions, the amino acid changes caused by these and other S-gene mutations likely represent a substantial remodeling of two functionally important components of the BA.1 Spike: the fusion domain (Gobeil et al. 2022) and the receptorbinding domain (Gobeil et al. 2022;McCallum et al. 2022).
The cluster region 3 encoded amino acid changes in the part of Spike that is responsible for membrane fusion suggest that the membrane fusion machinery of the BA.1 Spike may have been overhauled. These modifications likely contribute to observed increases relative to other VOCs in the structural flexibility of the portion of Spike surrounding the fusion peptide ( fig. 1) and likely expedite exposure and release of the peptide during the initiation of cell fusion (Gobeil et al. 2022). The structural consequences of the cluster region 3 mutations might additionally contribute to reduced TMPRSS2-mediated cleavage relative to Delta of BA.1 Spike at the polybasic S1/S2 cleavage site (Meng et al. 2022), reduced sensitivity to endosomal restriction factors (such as IFITM proteins) (Peacock et al. 2022), and a shift in the preferred route of cellular entry from surface to endosomal (Meng et al. 2022;Peacock et al. 2022;Willett et al. 2022): functionally important changes collectively resulting in a reduction relative to other SARS-CoV-2 lineages in the reliance of BA.1 on TMPRSS2 for cellular entry, a broadened cellular tropism, and a reduced propensity for infected cells to form syncytia (Meng et al. 2022;Peacock et al. 2022).
The mutations in cluster regions 1 and 2 fall within the receptor-binding domain (RBD) encoding part of the S-gene. These mutations, together with those at S/417, S/ 440, and S/446, underlie an extensive remodeling of the ACE2 receptor-binding surface (Mannar et al. 2022;McCallum et al. 2022); accommodating major changes in the way that Spike interacts with the ACE2 of humans and other animals (Cameroni et al. 2022;Peacock et al. 2022).
Of the cluster 2 sites, all of which fall within the receptor-binding motif encoding part of the RBD, only S/ 498 and S/505 show signs of the Wuhan-Hu-1 encoded amino acid state having been selectively favored in the past (S/498 in SARS-CoV-2 and S/505 in nCoV). No signs of any positive selection at the other cluster 2 sites in SARS-CoV-2 implies that changes at these and the negatively selected site in cluster 2 have likely not individually  fig. 9; ) have found little evidence that individual substitutions at S/505 have antigenic effects, whereas mutations at S/496R and S/498R have only moderate antigenic effects; similar to those of the 501Y mutation. The exception that proves the rule that sites in this region might not be free to change in response to immune pressures is S/493R. Given that S/493R has a strong antigenic effect, if it was not under selective constraints to sustain optimal degrees of ACE2 interaction (Starr, Zepeda, et al. 2022), it should (but does not) display at least intermittently detectable signs of positive selection.
It is, however, plausible that the cluster region 1 and 2 mutations are collectively immune evasive in that the extraordinary resistance of the BA.1 Spike to neutralization is likely at least partially attributable to its stabilization in the down configuration by the S/S371L, S/S373P, S/S375F, and S/Y505H mutations (Gobeil et al. 2022): a conformation that blocks the binding of neutralizing antibodies that target the RBD. Whereas natural selection appears to have previously favored SARS-CoV-2 variants with S-gene mutations such as S/D614G that destabilized the down configuration of Spike (since these increased the probability of successful Spike-ACE2 engagements) (Berger and Schaffitzel 2020), rising population immunity has now potentially shifted the selective environment in favor of mutations that stabilize the down configuration.
It is also noteworthy that, together with the BA.1 NTD deletion mutations, the S/214 EPA insertion, and other RBD mutations, the cluster 1 and 2 mutations have likely altered the conformational dynamics of the N2R linker region of Spike between the N-terminal domain and RBD such that one protomer in a trimer has an enhanced predisposition to transition into the up configuration: a mechanism that may enable efficient engagement of the BA.1 receptor-binding motif with ACE2 despite the increased stability of its Spike trimers when all their protomers are in the down configuration (Gobeil et al. 2022).
How Were the Cluster Region 1, 2, and 3 Mutations Assembled Within Omicron?
Given the manifest viability of BA.1 and the other Omicron sublineages there is a pressing need to understand how and why they accumulated so many mutations that, on their own at least, are apparently either selectively neutral or maladaptive. The genetic distance between the Omicron sublineages and their nearest known SARS-CoV-2 relatives implies that the Omicron progenitor accumulated its unprecedented number of mutations during an extensive period of undetected replication. When accurate molecular clock estimates are obtained of both the time when Omicron last shared a common ancestor with other SARS-CoV-2 lineages, and the time when all the detected Omicron sublineages last shared a common ancestor, we will have upper and lower bounds on the amount of time it took for Omicron to assemble its complement of mutations.
The Omicron progenitor could have spent this period of intensive or prolonged evolution in a region that carries out minimal genomic surveillance and/or where access to, or utilization of, healthcare resources is low (the surveillance failure hypothesis). Alternatively, this viral evolution could have taken place within a long-term infection (or possibly serial long-term infections; the chronic infection hypothesis), or during spread within a nonhuman host population (the reverse zoonosis hypothesis). Combinations of these evolutionary modes are also a possibility. We will only be able to distinguish between these hypotheses with more data.
Currently, the simple existence of three distinct Omicron lineages best supports the surveillance failure hypothesis at least for the latter stages of Omicron evolution following the divergence of the BA.1, BA.2, and BA.3 lineages from their most recent common ancestor. However, if similarly divergent SARS-CoV-2 variants are discovered together in either long-term human infections or co-circulating in other animal species, these would support the other hypotheses.
FIG. 9. Experimentally measured effects of RBD mutations on binding of monoclonal antibodies at sites that differ between the BA.1 lineage viruses and Wuhan-Hu-1. The line plot shows antibody binding escape measured by deep mutational scanning of the Wuhan-Hu-1 RBD , averaged across 36 monoclonal antibodies (8 class 1, 13 class 2, 7 class 3, and 8 class 4 antibodies). Sites that are mutated in the BA.1 relative to Wuhan-Hu-1 are indicated and colored according to the predicted antigenic effect of mutations at that site (strong, moderate, or minimal). An interactive version of this plot is available at https://jbloomlab.github.io/SARS2_RBD_Ab_escape_maps/. Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE Relative to evolution during normal SARS-CoV-2 person-to-person transmission, evolution within the context of either long-term infections or an alternative animal host could potentially have occurred at an accelerated pace Lu et al. 2021). In the context of either chronic infections of immunosuppressed individuals (Choi et al. 2020;Kemp et al. 2021;Cele, Karim, et al. 2022), or animals that naturally sustain long-term SARS-CoV-2 infections (such as may be the case for white-tailed deer given both the extraordinarily high frequencies of ongoing SARS-CoV-2 infections discovered in this species (Hale et al. 2022;Kuchipudi et al. 2022) and extensively evolved variants sampled from some of these animals (Pickering et al. 2022)), purifying selection may have been relaxed somewhat relative to that occurring during normal human-to-human transmission: enough so for genomes carrying suboptimal combinations of epistatically interacting mutations to remain viable while fitter combinations were discovered via additional mutations and genetic recombination. In addition, chronic infections are not impacted by the tight transmission bottlenecks that can stochastically purge nascent adaptive mutations during normal transmission (Braun et al. 2021;Lythgoe et al. 2021).
Sequential cycles of immune surveillance and viral immune escape within a long-term infection could also potentially explain the mutation clusters without the need to invoke compensatory epistatic interactions between mutations. Specifically, the clustered mutation patterns in the Spike proteins of BA.1 and other Omicron sublineages are reminiscent of those seen in the HIV envelope protein as a consequence of sequentially acquired virus mutations that evade the progressively broadening neutralization potential of a maturing antibody lineage (Landais et al. 2017). While signs of negative selection at 9/13 of the mutated codons in the three cluster regions of Omicron are not entirely consistent with this hypothesis, the overwhelming contributor to these negative selection signals are the selective processes operating during normal short-term SARS-CoV-2 infections where the antibody-pathogen dynamics simply do not have time to develop. It is possible that if purifying selection is relaxed at these sites during unusually prolonged infections, then neutralizing antibody evasion mutations might be tolerated. Even if purifying selection were not relaxed, however, during a chronic infection the potential long-term fitness costs that are incurred by highly effective immune evasion mutations might frequently be offset by the immediate fitness benefits of evading neutralization.

It Remains Unclear Whether Mutations in Cluster
Regions 1, 2, and 3 are Showing Signs of Reversion Whatever the process that yielded the three clusters of rarely seen mutations in the Omicron progenitor, now that it is being transmitted among people, any deleterious immune evasion mutations it has accumulated might be substantially less tolerable. Likewise, some of the mutations it may have accumulated during its adaptation to transmission in an alternative animal species would now also potentially be somewhat maladaptive. If the rarely seen mutations at negatively selected sites in the RBD of BA.1 lineage viruses that are known to be targeted by neutralizing antibodies have begun reverting since BA.1 emerged, it would best support the chronic infection hypothesis in that such reversions would imply a trade-off between intrahost replicative and/or movement fitness and immune evasion. Alternatively, if reversion mutations have occurred at BA.1 lineage virus receptor-binding motif sites that are known to impact human ACE2 binding but which have minor antigenic impacts, this would better support the reverse zoonosis hypothesis. Comparative evolutionary analyses focused on the BA.1 subclade of the SARS-CoV-2 phylogenetic tree revealed signatures of positive diversifying selection at 20 of the 28 S-gene codon sites that contain BA.1 lineage-defining mutations (table 2, bold, deletions/insertions were not considered). Strong evidence of positive selection (FEL P , 0.001) was also detectable at several codon sites of the S-gene that do not contain BA.1 lineage-defining mutations; most notably S/346 (R→K), S/452 (L→R), and S/701 (A→V). Amino acid changes encoded at all three of these codons are likely adaptive with S/R346K and S/L452R likely providing moderate degrees of escape from neutralizing antibodies , and S/A701V likely enhancing cleavage of Spike at the S1/S2 site (Escalera et al. 2022).
We found no molecular evidence for negative selection at any sites. At all sites, the vast majority of changes, measured either as fractions in all consensus genomes, or substitutions along internal branches of the phylogenetic tree Association between the proportion of sequences with missing data at a BA.1 mutation site and the number of reversion mutations seen at that site. This significant association between missing data and reversion mutation counts (dotted blue trendline with Pearson's R 2 = 0.773; P , 0.01) is likely attributable to miscalled nucleotides at BA.1 mutation sites whenever read coverage is low during sequencing. Under conditions when PCR/sequencing primers are not optimal for the amplification of BA.1 sequence, non-BA.1 SARS-CoV-2 genetic material contaminating sequencing instruments and other laboratory equipment used for sample preparation will occasionally yield more amplions/sequence reads than those from the intended BA.1 target sequences. Wherever the nucleotide states of these contaminant amplicons are different from those of the intended BA.1 target, they will frequently yield base miscalls during sequence assembly that, if the miscalled base corresponds with an ancestral state, will be misinterpreted as reversion mutations. Compared to BA.1 lineage-defining mutations in the S-gene at codon sites that are positively selected (red dots), the 13 mutations at negatively selected or neutrally evolving cluster region 1, 2, and 3 sites (blue dots) actually have a lower than average number of detectable reversion mutations (note how the blue dots predominantly fall below the blue trend line). Only one of these 13 mutations (at codon S/339) has a number of reversions that might be higher than expected given the percentage missing data for the codons where the mutations occur.
Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE of representative sequences, involve reversions to Wuhan-Hu-1 amino acid states. At all sites, a fraction of sampled genomes have missing data (fully or partially unresolved nucleotides; table 2). For key sites in RBD, this fraction is very high and, crucially, there is a strong correlation (R 2 = 0.773) between the percentage missing data at a site and the number of reversion mutations inferred at that site ( fig. 10). When multiplexing multiple samples in single sequencing runs, it is likely that known primer dropout issues for BA.1 sequences (Arctic Network) can result in the amplification of environmental SARS-CoV-2 nucleic acid templates (e.g., from Delta lineages) that contaminate sample preparation laboratories and sequencing devices. When sequence reads derived from these contaminating templates are amplified to a similar degree to (or a greater degree than) BA.1 templates for a given region and are then used to assign nucleotide states in assembled genomes, apparent reversion mutations could result.
It is, therefore, unsurprising that for every BA.1 mutation in cluster regions 1, 2, and 3, we found multiple instances of reversions occurring along internal tree branches (a mean of 4.7 reversions per site; computed over internal tree branches in the reduced haplotype tree; table 2). However, we noted that this pattern was also apparent for all of the other BA.1 spike mutations (a mean of 4.3 reversions per site): particularly so for the 15 BA.1 mutations falling within the RBD (mean of 5.3 for the cluster region 1 and 2 sites and 9.1 for the other sites). Furthermore, of the 144 reversion mutations found across all of the S-gene, 13 (9.0%) were within clusters of three to four contiguous mutations: a degree of clustering that is significantly higher than would be expected for random independent mutations (permutation P-value ,0.001; fig. 10). This pattern would, however, be expected with the widespread use of sequencing primers that are poorly suited to BA.1 sequencing.
When we account for the association between sequence coverage and reversion mutation counts, it is apparent that in the S-gene we do not see more reversion mutations at cluster region 1, 2, and 3 codon sites than at other BA.1 lineage-defining mutation sites ( fig. 10). It, therefore, follows that, by this metric, the cluster 1, 2, and 3 mutations are, with the possible exception of S/G339D ( fig. 10), not obviously less adaptive during the ongoing diversification of BA.1 than are other S-gene BA.1 lineage-defining mutations.
Despite not supporting one origin hypothesis over another, our inability to convincingly demonstrate unusually frequent reversions of cluster region 1, 2, and 3 mutations, remains consistent with the hypothesis that these mutations are broadly adaptive when they occur in the combinations found in BA.1 lineage viruses.

Conclusion
Regardless of how the complement of mutations in the three cluster regions was assembled, their presence in BA.1 together with indirect evidence that the mutations are epistatically interacting is concerning. As with the concomitant emergence of the Alpha, Beta, and Gamma VOCs in late 2020, part of the reason that the emergence of Omicron was a surprise is that the evolvability of SARS-CoV-2 is still deeply under-appreciated. It is becoming increasingly apparent that the evolutionary processes that yielded BA.1 involved balancing multiple fitness trade-offs: (1) between immune escape (Sheward et al. 2021;Cameroni et al. 2022;Cele, Jackson, et al. 2022;Liu et al. 2022;Willett et al. 2022) and affinity for human and/or animal ACE2 proteins (Cameroni et al. 2022;Gobeil et al. 2022;Mannar et al. 2022;McCallum et al. 2022;Meng et al. 2022;Peacock et al. 2022); (2) between efficient proteolytic priming with TMPRSS2 which expedites cellular entry via the cell surface (Meng et al. 2022;Peacock et al. 2022;Willett et al. 2022) and increased resistance to endosomal restriction factors (such as IFITM proteins) which enable more efficient cellular entry via the endocytic route (Peacock et al. 2022); (3) between preferred tropism for cells in the upper respiratory tract and preferred tropism for cells in the lower respiratory tract (Meng et al. 2022;Peacock et al. 2022); and (4) between increased propensity for Spike protomers to switch to the up configuration for ACE2 engagement and increased stabilization of the down configuration to prevent binding of neutralizing antibodies (Wrobel et al. 2020;Miller et al. 2021;Zimmerman et al. 2021;Gobeil et al. 2022). Fortunately, the collection of mutations in BA.1 appear at present to have tilted the balance of these and other tradeoffs toward the virus having decreased clinical severity in humans (Jassat et al. 2021;Davies et al. 2022).
It remains unclear what roles epistatic interactions between the BA.1 S-gene cluster region 1, 2, and 3 mutations have played in resolving these trade-offs. It is evident, however, that the extensive mutational changes in BA.1 that have collectively yielded these resolutions are as similar to "normal" stepwise mutational changes seen in previous variants as antigenic shifts are to antigenic drifts (van der Straten et al. 2022). The evolutionary dynamics of the clustered rarely seen mutations in the RBD and fusion domains of BA.1 lineage viruses suggest that-rather than merely supporting minor tweaks in the antigenicity of Spike, its ACE2 binding affinity or its membrane fusion properties -these mutations are likely pivotal to the big observed shifts in how BA.1 Spike proteins function.
While a threat in its own right, BA.1 is also a warning. It demonstrates that complex evolutionary remodeling of important functional elements of SARS-CoV-2 are not just possible, but are potentially already occurring unnoticed in other poorly sampled lineages. We should not complacently assume that the balance of fitness trade-offs achieved by the extensively evolved VOCs that succeed BA.1 will be similarly tilted toward lower severity.

Global Analyses of Selection
Unless specified otherwise, all analyses were performed on single gene (e.g., S) or peptide products (e.g., nsp3), since Martin et al. · https://doi.org/10.1093/molbev/msac061 MBE genes/peptides are the targets of selection. Global SARS-CoV-2 gene/peptide data sets were compiled (from GISAID; Elbe and Buckland-Merrett 2017), processed and analyzed at monthly intervals for evidence of selection acting on individual codon sites as in Martin et al. (2021). Results of these analyses at codons where Omicron mutations occur can be visualized using an Observable notebook at https:// observablehq.com/@spond/sars-cov-2-selected-sites.

Analyses of Intrapatient SARS-CoV-2 Diversity
Intrahost allelic variation seen at BA.1 amino acid mutation sites was analyzed in 282,788 annotated (i.e., with detailed associated metadata) publically available SARS-CoV-2 raw sequencing data sets from the UK, Greece, Estonia, Ireland, and South Africa between March 2020 and September 2021 all of which were processed and analyzed using the standardized variant calling pipeline described in Maier et al. (2021). All variant calling data for genomic sites where BA.1, 2, and 3 lineage-defining mutations occur were extracted from processed data sets available via ftp:// xfer13.crg.eu/ and https://covid19.galaxyproject.org/ genomics/global_platform/#processed-cog-uk-data can be explored using the observable notebook at https:// observablehq.com/@spond/intrahost-dashboard.

Analyses of Selection in Sarbecoviruses Related to SARS-CoV-2
The whole-genome sequences of 167 members of the Sarbecovirus subgenus (including SARS-CoV and SARS-CoV-2 Wuhan-Hu-1; see https://docs.google.com/ spreadsheets/d/1sSt7fRiBYeW9z5Amj1_ OywHhfxCnZ2wqo9gnLKsq74c/edit? usp=sharing for the full list of accession numbers) were aligned using MAFFT (with the local pair option (Katoh et al. 2017)). GARD (Kosakovsky Pond et al. 2006) was used on the wholegenome alignment to determine 26 recombination breakpoints based on which individual gene codon alignments were separated. Phylogenies for the resulting putatively nonrecombinant codon alignments were reconstructed using IQTREE2 (Nguyen et al. 2015) (GTR + I+F + G4 model) and selection signals specific to the nCoV clade branches were inferred using the FEL (Kosakovsky Pond and Frost 2005) and MEME (Murrell et al. 2015) methods as in MacLean et al. (2021). Results of these analyses for all gene regions can be explored using the observable notebook at https://observablehq.com/@spond/ncosevolution-nov-2021.
Analyses of Selection in the BA.1 Sublineage Because the codon-based selection analyses that we performed gain no power from including identical sequences, and minimal power from including sequences that are essentially identical, we filtered BA.1 and reference (GISAID) sequences using pairwise genetic distances complete linkage clustering with the tn93-cluster tool (https://github.com/veg/tn93). All groups of sequences that were within D genetic distance (Tamura-Nei 93) of every other sequence in the group were represented by a single (randomly chosen) sequence in the group. We set D at 0.0001 for lineagespecific sequence sets, and at 0.0015 for GISAID reference (or "background") sequence sets. We restricted the reference set of sequences to those sampled before October 15, 2020.
We inferred a maximum likelihood tree from the combined sequence data set using raxml-ng using default settings (GTR + G model, 20 starting trees). We partitioned internal branches in the resulting tree into two nonoverlapping sets used for testing and annotated the Newick tree. Because of a lack of phylogenetic resolution in some of the segments/genes, not all analyses were possible for all segments/genes. In particular, this is true when lineage BA.1 sequences were not monophyletic in a specific region, and no internal branches could be labeled as belonging to the focal lineage.
We used HyPhy v2.5.34 (http://www.hyphy.org/) (Kosakovsky Pond et al. 2020) to perform a series of selection analyses. Analyses in this setting need to account for a well-known feature of viral evolution (Poon et al. 2007) where terminal branches include "dead-end" (maladaptive or deleterious on the population level) (Kosakovsky Pond et al. 2020) mutation events within individual hosts which have not been "seen" by natural selection, whereas internal branches must include at least one transmission event. However, because our tree is reduced to only include unique haplotypes, even leaf nodes could represent "transmission" events, if the same leaf haplotype was sampled more than once (and the vast majority were). The branches leading to these repeatedly sampled haplotypes were, therefore, also included in the analyses.
We performed an additional analysis on BA.1 sequences, which includes data available in GISAID up to January 5, 2022. The workflow for intrahost gene analysis is as follows (code available at https://github.com/veg/omicronselection; note the scripts require the GISAID FASTA files and are not robust to changes in input format).
1) Obtain GISAID sequences annotated as BA.1. 2) Map them to the reference S-gene using bealign (part of the BioExt Python package). bealign -r CoV2-S input.fasta output.bam; bam2msa output.bam S.mapped.fasta 3) Identify all sequences that are identical up to ambiguous nucleotides using tn93-cluster (these are the unique haplotypes). tn93-cluster -f -t 0.0 S.mapped.fasta . S.clusters. 0.json; python3 python/clusterprocessor.py S.clusters.0.json . S.haplo.fasta 4) Reduce the set of unique haplotypes to clusters of sequences that are all within 0.002 genetic distance of one another (tn93-cluster -f -t 0.002 S.haplo.fasta . S.clusters.1.json; python3 python/cluster-processor.py S.clusters.1.json . S.uniq.fasta) Selection Analysis Identifies Clusters of Unusual Mutational Changes · https://doi.org/10.1093/molbev/msac061 MBE 5) Identify and remove all sequences that are 0.0075 subs/site away from the "main" clusters (outliers/ low quality sequences which result in long tree branches, or are possibly misclassified) 6) For each remaining sequence cluster, build a majority consensus sequence using resolved nucleotides (assuming there is at least 3). Remove clusters that comprise fewer than three sequences. Add reference sequences for BA.2 and BA.3 to add in tree rooting. 7) Building an ML phylogeny using raxml-ng.
Annotate BA.1 internal branches. 8) Gene-level tests for selection on the internal branches of the BA.1 clades using BUSTED (Murrell et al. 2015) with synonymous rate variation enabled. 9) Codon site-level tests for episodic diversifying (MEME; Murrell et al. 2015) and pervasive positive or negative selection (FEL; Kosakovsky Pond and Frost 2005) on the internal branches of the BA.1 clade. 10) Epistasis/coevolution inference on substitutions along internal branches of the BA.1 clade using Bayesian Graphical models (Poon et al. 2007). 11) We combined all the results using a Python script and visualized results using several open-source libraries in ObservableHQ (https://observablehq. com/@spond/ba1-selection).

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.

Data Availability
The data underlying the part of this study concerning analyses of compiled SARS-CoV-2 genome sequences are available from GISAID and cannot be redistributed by the authors. The data underlying the remainder of the study (sequence read archive data, sarvecovirus genome sequence data and deep mutational scanning data) will be shared on reasonable request to the corresponding author.